| FieldName | Description | Short |
|---|---|---|
| SeriousDlqin2yrs | 未来两年内是否会发生逾期 | Dlqin2yrs |
| RevolvingUtilizationOfUnsecuredLines | 无担保贷款占信贷限额比率 | Utilization |
| age | 贷款人年龄 | Age |
| NumberOfTime30.59DaysPastDueNotWorse | 逾期 30 到 59 天的贷款数目 | days30_59 |
| DebtRatio | 债务比率:债务占月收入比率 | DebtRatio |
| MonthlyIncome | 月收入 | Income |
| NumberOfOpenCreditLinesAndLoans | 开放贷款和信用贷款的数目 | OpenCredit |
| NumberOfTimes90DaysLate | 逾期超过 90 天的的贷款数目 | days90 |
| NumberRealEstateLoansOrLines | 抵押贷款数目 | RealEstate |
| NumberOfTime60.89DaysPastDueNotWorse | 逾期 60 到 89 天的贷款数目 | days60_89 |
| NumberOfDependents | 家庭成员数量 | Dependents |
1.关于缺失值插补与异常值删除
绝大多数的变量都使用中位数插补法
更好的选择是KNN插补,但是特别耗费运算时间
剔除了DebtRatio(负债率)超过100000和Income(月收入)超过300000的记录
2.对类失衡的处理
数据中一共有1962条违约(逾期)记录,占比为0.0654131
采用欠采样的方法处理
3.这三张图表
logi.model <- glm(Dlqin2yrs~.,data = ntrain,family = "binomial") summary(logi.model)
Call:
glm(formula = Dlqin2yrs ~ ., family = "binomial", data = ntrain)
Deviance Residuals:
Min 1Q Median 3Q Max
-4.8920 -0.7733 -0.4385 0.8624 2.1964
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -6.620e-01 1.897e-01 -3.490 0.000483 ***
Utilization 1.838e+00 1.227e-01 14.983 < 2e-16 ***
Age -2.261e-02 3.295e-03 -6.863 6.74e-12 ***
days30_59 5.712e-01 5.783e-02 9.877 < 2e-16 ***
DebtRatio -4.978e-05 3.607e-05 -1.380 0.167596
Income -2.781e-05 1.054e-05 -2.638 0.008351 **
OpenCredit 3.852e-02 8.921e-03 4.319 1.57e-05 ***
days90 1.005e+00 1.110e-01 9.052 < 2e-16 ***
RealEstate 1.003e-01 4.091e-02 2.452 0.014224 *
days60_89 1.368e+00 1.506e-01 9.083 < 2e-16 ***
Dependents 4.362e-02 3.874e-02 1.126 0.260235
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 4501.7 on 3249 degrees of freedom
Residual deviance: 3233.1 on 3239 degrees of freedom
AIC: 3255.1
Number of Fisher Scoring iterations: 6
inter.logi <- glm(Dlqin2yrs~(.)^2, data = ntrain, family = "binomial") step.logi <- step(inter.logi, trace = 0)
| Estimate | Std. Error | z value | Pr(>|z|) | mark | |
|---|---|---|---|---|---|
| (Intercept) | -0.74861 | 0.36179 | -2.06917 | 0.03853 | * |
| Utilization | 2.17237 | 0.14513 | 14.96837 | 0.00000 | ** |
| Age | -0.02181 | 0.00719 | -3.03386 | 0.00241 | ** |
| days30_59 | 0.92860 | 0.16014 | 5.79870 | 0.00000 | ** |
| DebtRatio | -0.00012 | 0.00005 | -2.23367 | 0.02550 | * |
| Income | 0.00005 | 0.00005 | 0.97456 | 0.32978 | |
| OpenCredit | -0.07447 | 0.04030 | -1.84788 | 0.06462 | |
| days90 | 1.85057 | 0.23538 | 7.86212 | 0.00000 | ** |
| RealEstate | 0.33620 | 0.19470 | 1.72678 | 0.08421 | |
| days60_89 | 1.02719 | 0.53280 | 1.92792 | 0.05386 | * |
| Dependents | 0.25361 | 0.18827 | 1.34704 | 0.17797 | |
| Utilization:days30_59 | -0.46559 | 0.16032 | -2.90413 | 0.00368 | ** |
| Utilization:days90 | -1.01120 | 0.29666 | -3.40857 | 0.00065 | ** |
| Utilization:days60_89 | -0.67634 | 0.39065 | -1.73133 | 0.08339 | |
| Age:Income | 0.00000 | 0.00000 | -1.70176 | 0.08880 | |
| Age:OpenCredit | 0.00194 | 0.00073 | 2.64573 | 0.00815 | ** |
| Age:RealEstate | -0.00666 | 0.00347 | -1.92082 | 0.05475 | * |
| Age:days60_89 | 0.01774 | 0.00987 | 1.79718 | 0.07231 | |
| Age:Dependents | -0.00623 | 0.00417 | -1.49330 | 0.13536 | |
| days30_59:DebtRatio | 0.00012 | 0.00008 | 1.55256 | 0.12053 | |
| days30_59:Income | 0.00002 | 0.00001 | 1.24395 | 0.21352 | |
| days30_59:OpenCredit | -0.01668 | 0.01010 | -1.65085 | 0.09877 | |
| days30_59:days90 | -0.20685 | 0.06207 | -3.33236 | 0.00086 | ** |
| days30_59:days60_89 | -0.33258 | 0.07918 | -4.20052 | 0.00003 | ** |
| OpenCredit:RealEstate | 0.01083 | 0.00566 | 1.91496 | 0.05550 | * |
| OpenCredit:Dependents | 0.01594 | 0.00839 | 1.89923 | 0.05753 | * |
| RealEstate:days60_89 | 0.31986 | 0.16359 | 1.95522 | 0.05056 | * |
| RealEstate:Dependents | -0.05112 | 0.03473 | -1.47204 | 0.14101 |
该逻辑回归是增加交叉项并使用逐步回归筛选变量后的逻辑回归
变量重要性可以用作变量选择以及在权值分配上做参考
随机森林使用了准确率和基尼指数进行重要性的度量
它认为较为重要的变量是:Utilization / Age / days30_59 / days90
这里将逻辑回归与三种(集成)树模型进行了比较
对Xgboost也输出了变量重要性,它认为较为重要的变量是:Utilization / days90 / days30_59 / days60_89
| Model | AUC | |
|---|---|---|
| 3 | GBM | 0.8695062 |
| 2 | RF | 0.8661534 |
| 4 | XGB | 0.8624757 |
| 1 | LR | 0.8593598 |
| Dlqin2yrs | Utilization | Age | days30_59 | DebtRatio | Income | OpenCredit | days90 | RealEstate | days60_89 | Dependents | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 123909 | 1 | 3 | 1 | 4 | 3 | 4 | 4 | 1 | 2 | 0 | 1 |
| 62200 | 0 | 6 | 3 | 3 | 2 | 4 | 3 | 0 | 1 | 0 | 2 |
| 111351 | 1 | 6 | 1 | 5 | 2 | 3 | 3 | 0 | 1 | 0 | 0 |
| 117359 | 0 | 2 | 3 | 0 | 1 | 4 | 3 | 0 | 2 | 0 | 0 |
| 11221 | 1 | 6 | 4 | 1 | 4 | 1 | 4 | 0 | 1 | 0 | 0 |
| 89607 | 0 | 2 | 3 | 1 | 3 | 1 | 3 | 0 | 2 | 0 | 0 |
| 34652 | 1 | 6 | 3 | 0 | 4 | 4 | 3 | 0 | 5 | 0 | 0 |
| 111292 | 0 | 4 | 2 | 0 | 2 | 2 | 4 | 0 | 1 | 0 | 0 |
| 118978 | 0 | 6 | 3 | 0 | 4 | 2 | 1 | 0 | 0 | 1 | 0 |
| 83673 | 1 | 1 | 4 | 2 | 4 | 2 | 4 | 0 | 2 | 1 | 0 |
| 84537 | 0 | 4 | 1 | 1 | 1 | 3 | 1 | 0 | 0 | 0 | 4 |
| 76702 | 0 | 2 | 4 | 0 | 4 | 2 | 3 | 0 | 1 | 0 | 0 |
| 27800 | 0 | 1 | 4 | 0 | 1 | 3 | 1 | 0 | 0 | 0 | 0 |
| 31192 | 1 | 6 | 4 | 1 | 3 | 1 | 1 | 0 | 0 | 0 | 0 |
| 78356 | 1 | 6 | 1 | 0 | 1 | 2 | 1 | 0 | 0 | 0 | 0 |
| 19584 | 0 | 2 | 3 | 1 | 3 | 4 | 4 | 0 | 5 | 0 | 1 |
| 54580 | 0 | 5 | 4 | 0 | 2 | 4 | 3 | 0 | 4 | 0 | 0 |
| 81326 | 0 | 1 | 4 | 0 | 2 | 4 | 3 | 0 | 2 | 0 | 1 |
| 102688 | 1 | 6 | 2 | 0 | 3 | 2 | 1 | 4 | 1 | 3 | 2 |
| 145125 | 0 | 1 | 1 | 0 | 2 | 2 | 2 | 0 | 2 | 0 | 1 |
| Dependents | days60_89 | RealEstate | days90 | OpenCredit | Income | DebtRatio | days30_59 | Age | Utilization | Dlqin2yrs | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 4057 | 0.0000 | -0.2939 | -0.2151 | -0.3663 | -0.1762 | -0.2912 | -0.1853 | -0.5232 | 0.4455 | -0.5936 | 0 |
| 1521 | -0.1359 | 1.8083 | -0.2151 | 3.0681 | -0.1762 | -0.0688 | 0.0040 | -0.5232 | -0.7340 | 1.1283 | 1 |
| 3684 | 0.2636 | -0.2939 | -0.1446 | -0.3663 | -0.1593 | -0.2912 | -0.1853 | -0.5232 | -0.0513 | -0.5936 | 0 |
| 1442 | -0.1359 | 1.8083 | -0.1446 | -0.3663 | -0.1593 | 0.4154 | 0.0040 | -0.5232 | -0.0513 | 1.1283 | 1 |
| 1002 | -0.1359 | -0.2939 | 0.1973 | -0.3663 | 0.2683 | -0.2912 | -0.0640 | -0.5232 | -0.7340 | -1.5180 | 0 |
| 2828 | 0.1842 | -0.2939 | -0.1446 | -0.3663 | -0.1593 | -0.2912 | 0.2402 | -0.5232 | -0.0513 | 1.1283 | 0 |
| 1047 | -0.1359 | -0.2939 | -0.2151 | 1.7977 | 0.2683 | -0.2912 | -0.1853 | -0.5232 | -0.0513 | 0.2200 | 0 |
| 1524 | -0.1359 | -0.2939 | -0.1446 | -0.3663 | -0.1593 | -0.0470 | 0.0040 | -0.5232 | -0.0513 | -0.8725 | 1 |
| 1760 | -0.1359 | -0.2939 | -0.2151 | -0.3663 | -0.1593 | -0.0688 | 0.2402 | -0.5232 | 0.4455 | -0.5936 | 1 |
| 3141 | 0.0685 | -0.2939 | 0.1973 | -0.3663 | -0.1762 | -0.0688 | 0.0040 | -0.5232 | 0.2453 | 1.1283 | 1 |
| 3287 | 0.0685 | -0.2939 | 0.1973 | -0.3663 | -0.1762 | -0.0688 | 0.0040 | -0.5232 | 0.2453 | -0.5936 | 0 |
| 256 | -0.1359 | -0.2939 | 0.1973 | -0.3663 | -0.1762 | 0.4154 | -0.0640 | 2.4277 | 0.4455 | -0.5936 | 0 |
| 4028 | 0.7259 | -0.2939 | 0.1973 | 2.9704 | -0.1762 | -0.0470 | -0.0640 | 0.9474 | 0.2453 | 0.2200 | 1 |
| 3948 | 0.7259 | 1.8083 | -0.2151 | 1.7977 | -0.1762 | 0.4154 | 0.0040 | -0.5232 | 0.2453 | 1.1283 | 1 |
| 4007 | 0.7259 | -0.2939 | -0.1446 | -0.3663 | -0.1593 | -0.2912 | -0.1853 | -0.5232 | -0.7340 | -0.5936 | 0 |
| 3263 | 0.0685 | -0.2939 | -0.2151 | -0.3663 | -0.1593 | -0.0688 | 0.2402 | -0.5232 | 0.2453 | 0.2200 | 0 |
| 332 | -0.1359 | -0.2939 | 0.1973 | -0.3663 | 0.2683 | 0.4154 | -0.1853 | -0.5232 | 0.4455 | -1.5180 | 1 |
| 1030 | -0.1359 | -0.2939 | -0.2151 | -0.3663 | 0.0634 | -0.0688 | 0.0040 | 0.9474 | -0.7340 | -1.5180 | 0 |
| 489 | -0.1359 | -0.2939 | 0.1973 | 2.9704 | 0.2683 | -0.0688 | 0.0040 | 1.4214 | 0.2453 | -1.5180 | 1 |
| 3377 | 0.0685 | -0.2939 | -0.1446 | -0.3663 | 0.0634 | -0.2912 | -0.1853 | -0.5232 | 0.2453 | -1.5180 | 0 |
-0.0290004, 0.6745775, 0.5690218, 0.6982005, 0.5903734, 0.2173597, 0.3615053, 1.0695925, 0.6253708, 0.3927561, 0.6824976
我们知道,Logistic回归一般方程形式为: \[ \log(odds)=\log(\frac{p}{1-p})=\beta_0+\beta_1*x_1+\beta_2*x_2+...+\beta_9*x_9+\beta_{10}*x_{10} \]
在我们的分析中,对应方程是: \[ \log(odds)=\log(\frac{p}{1-p})=-0.0290+0.6746*Dependents+0.5690*days60.89+0.6982*RealEstate+0.5904*days90-\\ 0.2174*OpenCredit+0.3615*Income+1.0696*DebtRatio+0.6254*days30.59+0.3928*Age+0.6825*Utilization \] 注意,这里的变量名代表该变量的woe值,且\[p=P_{(Dlqin2yrs=1)}\]
经过WOE变换后,方程形式为: \[ \log(odds)=\log(\frac{p}{1-p})=\beta_0+\\ \beta_1*(w_{11}*\delta_{11}+w_{12}*\delta_{12}+...)+\\ \beta_2*(w_{21}*\delta_{21}+w_{22}*\delta_{22}+...)+...+\\ \beta_{10}*(w_{10,1}*\delta_{10,1}+w_{10,2}*\delta_{10,2}+...) \]
因此各自变量每个因子水平的分值即为βi*wij,δ为二元变量,取值只有 0或1,且某行记录在某个xi 的各个因子水平上只能有一个1,其它均是0,取 1表示这行记录在xi 变量上取第j 个水平。该方程也被称作标准评分卡模型。
常用的假设: 1. 违约/正常比为15 时,对应600 分。 2. 违约/正常比翻一倍扣掉20 分。 根据下表计算可知: A=521.87, B=28.85。
| 分值score | 好坏比odds | 违约概率p |
|---|---|---|
| 700 | 1 : 480 | 0.0020790 |
| 680 | 1 : 240 | 0.0041494 |
| 660 | 1 : 120 | 0.0082645 |
| 640 | 1 : 60 | 0.0163934 |
| 620 | 1 : 30 | 0.0322581 |
| 600 | 1 : 15 | 0.0625000 |
| 580 | 1 : 7.5 | 0.1176471 |
| 560 | 1 : 3.75 | 0.2105263 |
| 540 | 1 : 1.875 | 0.3478261 |
| 银行逾期还款人员 | 涉毒类重点人员 | 侵财类重点人员 |
|---|---|---|
| 未来两年内是否会发生逾期(y) | 未来XX时间内是否会吸毒(y) | 未来XX时间内是否会侵财(y) |
| 无担保贷款占信贷限额比率 | 年龄 | 年龄 |
| 贷款人年龄 | 职业 | 职业 |
| 逾期 30 到 59 天的贷款数目 | 婚姻状况 | 婚姻状况 |
| 债务比率:债务占月收入比率 | 受教育程度 | 受教育程度 |
| 月收入 | 出入境特征 | 是否前科 |
| 开放贷款和信用贷款的数目 | 是否吸毒前科 | 与前科人员同住一个旅馆 |
| 逾期超过 90 天的的贷款数目 | 是否其他类前科 | 与前科人员同乘一个航班 |
| 抵押贷款数目 | 入住、退房时间不正常 | 入住、退房时间不正常 |
| 逾期 60 到 89 天的贷款数目 | 上网时间不正常 | 上网时间不正常 |
| 家庭成员数量 | 与涉毒人员同住一个旅馆 | 是否经常本地人入住本地旅馆 |
| ··· | 与涉毒人员同乘一个航班 | 过往在街道路面被盘查次数 |
| ··· | 与其他前科人员同住一个旅馆 | ··· |
| ··· | 与其他前科人员同乘一个航班 | ··· |
| ··· | 是否经常本地人入住本地旅馆 | ··· |
| ··· | ··· | ··· |
---
title: "评分卡制作"
author: "GuQuQ"
output:
flexdashboard::flex_dashboard:
social: menu
source_code: embed
---
```{r}
knitr::opts_chunk$set(message = F, warning = F,cache = T)
```
```{r setup, include=FALSE}
library(flexdashboard)
library(knitr)
library(dplyr)
library(ggplot2)
library(caret)
library(ROCR)
library(cvAUC)
library(xgboost)
library(plotly)
library(DT)
library(effects)#effects
library(reshape2)#melt
library(psych)#pairs.panels
library(corrplot)#corrplot
library(randomForest)
library(gbm)
library(xgboost)
```
1. Data Processing {data-orientation=rows}
=======================================================================
Row {data-height=600}
-----------------------------------------------------------------------
### a. 随机选取20条样例数据查看
```{r}
df0<-read.csv("C:\\Users\\guquan\\Desktop\\评分卡制作demo\\cs-training.csv")
set.seed(66)
slct=createDataPartition(df0$SeriousDlqin2yrs,p = 0.2,list = F)
df=df0[slct,]
df=df[,-1]
a=names(df)
b=strsplit(x = '未来两年内是否会发生逾期
无担保贷款占信贷限额比率
贷款人年龄
逾期 30 到 59 天的贷款数目
债务比率:债务占月收入比率
月收入
开放贷款和信用贷款的数目
逾期超过 90 天的的贷款数目
抵押贷款数目
逾期 60 到 89 天的贷款数目
家庭成员数量',split = '\n')[[1]]
c=c("Dlqin2yrs","Utilization","Age","days30_59","DebtRatio","Income",
"OpenCredit","days90","RealEstate","days60_89","Dependents") -> colnames(df)
datatable(sample_n(df,20))
#str(df)
```
Row {data-height=400}
-----------------------------------------------------------------------
### b. 字段解释:数据一共包含11列,3万条记录;“SeriousDlqin2yrs”是因变量
```{r}
kable(data.frame(FieldName =a,Description=b,Short=c),
format = "markdown", align = "c")
```
### c. 缺失值查看:主要分布在Income列和Dependents列
```{r}
library(VIM)
aggr(df,numbers=T)
#matrixplot(df)
```
```{r}
### 异常值与缺失值处理
df$"Utilization"[df$"Utilization">1] <- median(df$"Utilization",na.rm = TRUE)
df$"days30_59"[df$"days30_59">=96]<-0
df<-df[-which(df$DebtRatio>100000),]
df$"Income"[is.na(df$"Income")]<-median(df$"Income",na.rm = TRUE)
df<-df[-which(df$"Income">300000),]
df$"days90"[df$"days90">90]<-0
df<-df[!(df$"RealEstate"==54),]
df$"days60_89"[df$"days60_89">90]<-0
df$"Dependents"[is.na(df$"Dependents")]<-0
```
```{r}
### 不平衡处理
#mean(df$"Dlqin2yrs")#0.065
#sum(df$"Dlqin2yrs"==1)
newdf<-df[df$"Dlqin2yrs"==1,]
DownsampleDat<-df[df$"Dlqin2yrs"==0,]
set.seed(77)
downsam<-sample(1:nrow(DownsampleDat),2100)
nDat<-rbind(newdf,DownsampleDat[downsam,])
set.seed(88)
nDat<-nDat[sample(nrow(nDat)),]
nDat$"Dlqin2yrs"<-as.factor(nDat$"Dlqin2yrs")
set.seed(36)
trainIndex <- createDataPartition(nDat$Dlqin2yrs, p = .8, list = FALSE)
ntrain<-nDat[trainIndex,]
ntest<-nDat[-trainIndex,]
```
2. Statistic Analysis
=======================================================================
Column {.tabset}
-----------------------------------------------------------------------
### 相关矩阵
```{r dev = "svg"}
M=cor(nDat[,-1])
corrplot(M, method="pie",tl.cex = 0.5,tl.pos="d",cl.pos = "b")
```
### 散点矩阵
```{r}
pairs.panels(nDat[,-1],cex.cor = 0.8, cex = 0.1)
```
### 各变量条件分布
```{r dev = "svg",eval=F}
# melting the data frame
feature.names<-names(nDat)[-1]
vizDat<- melt(nDat,id.vars = 'Dlqin2yrs'
,measure.vars = feature.names, variable.name = "Feature"
,value.name = "Value")
# conditional box plots for each feature on the response variable
p <- ggplot(data = vizDat, aes(x=Feature, y=Value)) + geom_boxplot(aes(fill=Dlqin2yrs),outlier.size = 0.3)
p <- p+ theme(axis.text.x = element_text(color="white"))+ facet_wrap( ~ Feature, scales="free")
p <- p + ggtitle("Conditional Distributions of each variable")
p
```
Column {data-width=300}
-----------------------------------------------------------------------
### 说明
1.关于缺失值插补与异常值删除
- 绝大多数的变量都使用中位数插补法
- 更好的选择是KNN插补,但是特别耗费运算时间
- 剔除了DebtRatio(负债率)超过100000和Income(月收入)超过300000的记录
2.对类失衡的处理
- 数据中一共有`r sum(df$"Dlqin2yrs"==1)`条违约(逾期)记录,占比为`r mean(df$"Dlqin2yrs")`
- 采用欠采样的方法处理
3.这三张图表
- 反映了各自变量相关之间的关系以及与因变量之间的关系
3. Logistic Regression
=======================================================================
Column {.tabset}
-----------------------------------------------------------------------
### 初步回归
logi.model <- glm(Dlqin2yrs~.,data = ntrain,family = "binomial")
summary(logi.model)
```{r}
logi.model<-glm(Dlqin2yrs~.,data = ntrain,family = "binomial")
summary(logi.model)
```
### 增加交互项并进行逐步回归
inter.logi <- glm(Dlqin2yrs~(.)^2, data = ntrain, family = "binomial")
step.logi <- step(inter.logi, trace = 0)
```{r}
inter.logi<-glm(Dlqin2yrs~(.)^2,data = ntrain,family = "binomial")
step.logi<-step(inter.logi, trace = 0)
aa=as.data.frame(round(summary(step.logi)$coefficients,5))
aa$mark=ifelse(aa$`Pr(>|z|)`>0.06," ",ifelse(aa$`Pr(>|z|)`>0.02,"*","**"))
kable(aa,format = "markdown", align = "c")
```
Column {.tabset}
-----------------------------------------------------------------------
### 初步回归效应
```{r dev = "png", fig.width=8,fig.height=7}
plot(allEffects(logi.model),cex=0.2,main=NULL)
```
### 逐步回归效应
```{r dev = "png", fig.align="center", fig.width=8, fig.height=6}
plot(effect("Utilization:days90",step.logi),cex=0.5)
plot(effect("days30_59:days60_89",step.logi),cex=0.5)
plot(effect("Age:OpenCredit",step.logi),cex=0.5)
plot(effect("RealEstate:days60_89",step.logi),cex=0.5)
```
4. Evaluation and Comparing {data-orientation=rows}
=======================================================================
Row
-------------------------------------
```{r}
pred.step.logi<-predict(step.logi,ntest[,-1],type='response')
out.step<-cvAUC(pred.step.logi,ntest$Dlqin2yrs)
```
### a.逻辑回归ROC曲线,AUC值为`r out.step$cvAUC`
```{r}
FPR=out.step$perf@x.values[[1]];TPR=out.step$perf@y.values[[1]]
p <- ggplot(data = NULL,mapping = aes(FPR,TPR))+
geom_line(colour="red")+ xlab(out.step$perf@x.name) + ylab(out.step$perf@y.name)
p <- p+geom_line(mapping = aes(x=seq(0,1,by=0.01),y=seq(0,1,by=0.01)),
colour="lightblue",linetype=2)
ggplotly(p)
```
### b. 随机森林变量重要性
```{r}
#randomForest
set.seed(123)
ml.forest<-randomForest(Dlqin2yrs~.,
data = ntrain,mtry=2,
importance=TRUE,
ntree=1000)
pred.forest<-predict(ml.forest,ntest[,-1],'prob')[,2]
pr.forest <- prediction(pred.forest, ntest$Dlqin2yrs)
prf.forest <- performance(pr.forest, measure = "tpr", x.measure = "fpr")
par(bg="#F0F0F0")
cl= c("#009393","#2894FF","sienna3")
varImpPlot(ml.forest,color=cl,bg=cl,main="RandomForest Var-Importance")
par(bg="white")
```
### 说明
- 该逻辑回归是增加交叉项并使用逐步回归筛选变量后的逻辑回归
- 变量重要性可以用作变量选择以及在权值分配上做参考
- 随机森林使用了准确率和基尼指数进行重要性的度量
- 它认为较为重要的变量是:**Utilization / Age / days30_59 / days90**
Row
-------------------------------------
### c. Xgboost模型变量重要性
```{r}
#gbm
set.seed(234)
ml.gbm <- gbm.fit(x = ntrain[,-1],y = as.numeric(ntrain[,1])-1,
n.trees = 300,interaction.depth = 3,
var.monotone = c(0,0,1,1,0,0,0,0,0,0),
shrinkage = 0.01, n.minobsinnode = 10,
nTrain =nrow(ntrain)*0.8, verbose = 0)
pred.gbm <- predict(ml.gbm,ntest[,-1],n.trees = 300)
pr.gbm <- prediction(pred.gbm, ntest$Dlqin2yrs)
prf.gbm <- performance(pr.gbm, measure = "tpr", x.measure = "fpr")
#xgboost
set.seed(456)
dtrain <- xgb.DMatrix(data = as.matrix(ntrain[,-1])
, label = as.numeric(ntrain$Dlqin2yrs)-1)
dtest<- xgb.DMatrix(data = as.matrix(ntest[,-1])
, label = as.numeric(ntest$Dlqin2yrs)-1)
ml.bst <- xgb.train(data=dtrain, max.depth=3,
eta=0.01, nthread = 2, nround=100,
eval.metric = "error",
eval.metric = "logloss", verbose = 0,
objective = "binary:logistic")
xgb.imp <- xgb.importance(model = ml.bst)
xgb.imp$Feature=names(ntrain)[-1][as.numeric(xgb.imp$Feature)+1]
#xgb.plot.importance(importance_matrix = xgb.imp)
xgb.imp$Feature=factor(xgb.imp$Feature,levels = xgb.imp$Feature[order(xgb.imp$Gain,decreasing = T)])
p <- ggplot(data = xgb.imp, mapping = aes(x=Feature,y=Gain,fill=Feature))+
geom_bar(stat="identity")+theme(axis.text.x = element_text(angle = 45))
ggplotly(p)
pred.xgb<-predict(ml.bst,dtest)
pr.xgb <- prediction(pred.xgb, ntest$Dlqin2yrs)
prf.xgb <- performance(pr.xgb, measure = "tpr", x.measure = "fpr")
```
### d. 四种模型进行比较
```{r}
fpr.rf=prf.forest@x.values[[1]];tpr.rf=prf.forest@y.values[[1]]
fpr.gbm=prf.gbm@x.values[[1]];tpr.gbm=prf.gbm@y.values[[1]]
fpr.xgb=prf.xgb@x.values[[1]];tpr.xgb=prf.xgb@y.values[[1]]
lens=sapply(list(FPR,fpr.rf,fpr.gbm,fpr.xgb),length)
roc <- data.frame(Model=rep(c("LR","RF","GBM","XGB"),lens),
fpr=c(FPR,fpr.rf,fpr.gbm,fpr.xgb),
tpr=c(TPR,tpr.rf,tpr.gbm,tpr.xgb))
#roc curve
p <- ggplot(data = roc, mapping = aes(x=fpr,y=tpr,colour=Model))+
geom_line()+xlab(out.step$perf@x.name) + ylab(out.step$perf@y.name)+
geom_line(mapping = aes(x=seq(0,1,length.out = sum(lens)),
y=seq(0,1,length.out = sum(lens))),
colour="gray",linetype=2)
ggplotly(p)
#auc
allAuc <- data.frame(Model=c("LR","RF","GBM","XGB"),
AUC=c(out.step$cvAUC,
performance(pr.forest, measure = "auc")@y.values[[1]],
performance(pr.gbm, measure = "auc")@y.values[[1]],
performance(pr.xgb, measure = "auc")@y.values[[1]]))
allAuc=allAuc[order(allAuc$AUC,decreasing = T),]
```
### 说明
- 这里将逻辑回归与三种(集成)树模型进行了比较
- 对Xgboost也输出了变量重要性,它认为较为重要的变量是:**Utilization / days90 / days30_59 / days60_89**
- 四种模型的AUC值从高到低依次是:
`r kable(allAuc,format = "markdown",align = "c")`
5. WOE Transformation
=======================================================================
Column {data-width=700}
-----------------------------------------------------------------------
### 变量分箱
```{r}
### 连续变量分箱
library(discretization)
nDat.list <- as.list(nDat[,-1])
cut.points <- lapply(nDat.list,FUN = function(x){
p=cutPoints(x,y=nDat$Dlqin2yrs)#信息熵
if(length(p)<=2){
p=unique(as.numeric(quantile(x,c(0,1/4,2/4,3/4,1))))#等宽分箱
if(length(p)<=4){
return("Category") #手动分类
} else {
return(p)
}
}
c(min(x),sort(p),max(x))
})
nDat.group=nDat
catgry=c()
for(i in 1:length(cut.points)){
if(cut.points[[i]][1]=="Category") {
catgry=c(catgry,names(cut.points[i]))
next
}
nDat.group[,i+1]=as.integer(cut(nDat.list[[i]],
breaks = cut.points[[i]],
include.lowest = TRUE))
}
### 分类变量手动处理
#sapply(nDat.list[catgry],table)
nDat.group$days30_59[nDat.group$days30_59>6] <- 6
nDat.group$days90[nDat.group$days90>4] <- 4
nDat.group$RealEstate[nDat.group$RealEstate>6] <- 6
nDat.group$days60_89[nDat.group$days60_89>3] <- 3
nDat.group$Dependents[nDat.group$Dependents>5] <- 5
kable(sample_n(nDat.group,20), format = "markdown", align = "c")
```
### WOE变换
```{r}
library(woe)
woe.all <- sapply(names(nDat.group)[-1], simplify = F, USE.NAMES = T, function(x){
temp = woe(Data = nDat.group, Independent = x, Continuous = F,
Dependent = "Dlqin2yrs", C_Bin = length(unique(nDat.group[,x])),
Bad = 0, Good = 1)[,c("BIN","BAD%","GOOD%")]
temp$`BAD%`=ifelse(temp$`BAD%`<0.001,0.001,temp$`BAD%`)
temp$WOE = log(temp$`GOOD%`/temp$`BAD%`)#逾期(1)/按时(0)
temp[,c("BIN","WOE")]
})
#woe.all
nDat.woe = nDat.group
for(v in names(nDat.group)[-1]){
nDat.woe = merge(nDat.woe,woe.all[v][[1]],by.x=v,by.y="BIN",all=F)
nDat.woe[,v]= nDat.woe$WOE
nDat.woe$WOE <- NULL
}
aa=sample_n(nDat.woe,20)
for(i in 1:10){
aa[,i]=round(aa[,i],4)
}
kable(aa, format = "markdown", align = "c")
```
Column
-----------------------------------------------------------------------
### 各变量对应的WOE值
```{r fig.align="center"}
woe.df=data.frame()
for(i in 1:length(woe.all)){
temp=woe.all[[i]];temp$Feature=names(woe.all[i])
woe.df=rbind(woe.df,temp)
}
p <- ggplot(data = woe.df, aes(x=Feature,y=WOE,colour=BIN))+geom_point(size=3,alpha = 0.8)
p <- p + theme(axis.text.x = element_text(angle = 45)) + ggtitle("Vars with WOE")
ggplotly(p)
```
6. Score Card
=======================================================================
Column
-----------------------------------------------------------------------
### 标准评分卡模型
```{r}
last.lm <- glm(Dlqin2yrs~. ,nDat.woe, family = "binomial")
```
1. 逻辑回归的各项系数为:
`r coef(last.lm)`
2. 我们知道,Logistic回归一般方程形式为:
$$
\log(odds)=\log(\frac{p}{1-p})=\beta_0+\beta_1*x_1+\beta_2*x_2+...+\beta_9*x_9+\beta_{10}*x_{10}
$$
3. 在我们的分析中,对应方程是:
$$
\log(odds)=\log(\frac{p}{1-p})=-0.0290+0.6746*Dependents+0.5690*days60.89+0.6982*RealEstate+0.5904*days90-\\
0.2174*OpenCredit+0.3615*Income+1.0696*DebtRatio+0.6254*days30.59+0.3928*Age+0.6825*Utilization
$$
注意,这里的变量名代表该变量的woe值,且$$p=P_{(Dlqin2yrs=1)}$$
4. 经过WOE变换后,方程形式为:
$$
\log(odds)=\log(\frac{p}{1-p})=\beta_0+\\
\beta_1*(w_{11}*\delta_{11}+w_{12}*\delta_{12}+...)+\\
\beta_2*(w_{21}*\delta_{21}+w_{22}*\delta_{22}+...)+...+\\
\beta_{10}*(w_{10,1}*\delta_{10,1}+w_{10,2}*\delta_{10,2}+...)
$$
5. 因此各自变量每个因子水平的分值即为βi*wij,δ为二元变量,取值只有
0或1,且某行记录在某个xi 的各个因子水平上只能有一个1,其它均是0,取
1表示这行记录在xi 变量上取第j 个水平。该方程也被称作标准评分卡模型。
Column {.tabset}
-----------------------------------------------------------------------
### 评分卡的创建和实施
1. 刻度转换公式:
$${\rm Score} = A-B*\log(odds)$$
常用的假设:
1. 违约/正常比为15 时,对应600 分。
2. 违约/正常比翻一倍扣掉20 分。
根据下表计算可知:
A=521.87, B=28.85。
```{r}
p=function(x)x/(x+1)
scales <- data.frame("分值score" = 720-20*1:9,
"好坏比odds" = paste("1 :",480/(2^(0:8))),
"违约概率p" = p((1/480)*(2^(0:8))))
kable(scales, format = "markdown", align = "c")
```
2. 联立逻辑回归方程和刻度公式,可得分值表达式:
$$
{\rm Score} = A-B\beta_0-\\
B\beta_1*(w_{11}*\delta_{11}+w_{12}*\delta_{12}+...)-\\
B\beta_2*(w_{21}*\delta_{21}+w_{22}*\delta_{22}+...)-...-\\
B\beta_{10}*(w_{10,1}*\delta_{10,1}+w_{10,2}*\delta_{10,2}+...)
$$
### 最终评分卡
```{r}
beta <- data.frame(coef(last.lm))
beta$Feature=row.names(beta)
beta=beta[-1,]
row.names(beta) <- NULL; names(beta)[1] <- "beta"
woe.df.beta <- merge(woe.df, beta, by = "Feature", all = F)
woe.df.beta$Score = -28.85*woe.df.beta$WOE*woe.df.beta$beta
#woe.df.beta
```
### 在绍兴项目上的应用
```{r}
sdry = strsplit(x = '年龄
职业
婚姻状况
受教育程度
出入境特征
是否吸毒前科
是否其他类前科
入住、退房时间不正常
上网时间不正常
与涉毒人员同住一个旅馆
与涉毒人员同乘一个航班
与其他前科人员同住一个旅馆
与其他前科人员同乘一个航班
是否经常本地人入住本地旅馆', split='\n')[[1]]
qcry = strsplit(x = '年龄
职业
婚姻状况
受教育程度
是否前科
与前科人员同住一个旅馆
与前科人员同乘一个航班
入住、退房时间不正常
上网时间不正常
是否经常本地人入住本地旅馆
过往在街道路面被盘查次数', split='\n')[[1]]
yqry=b
yqry[1]=paste0(yqry[1],"(y)")
kable(data.frame(`银行逾期还款人员` = c(yqry,'···','···','···','···','···'),
`涉毒类重点人员` = c('未来XX时间内是否会吸毒(y)',sdry,'···'),
`侵财类重点人员` = c('未来XX时间内是否会侵财(y)',qcry,'···','···','···','···')))
```